Search CORE

15 research outputs found

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Author: Arjevani Yossi
Publication venue
Publication date: 06/12/2017
Field of study

We study the conditions under which one is able to efficiently apply variance-reduction and acceleration schemes on finite sum optimization problems. First, we show that, perhaps surprisingly, the finite sum structure by itself, is not sufficient for obtaining a complexity bound of \tilde{\cO}((n+L/\mu)\ln(1/\epsilon)) for

L

-smooth and

\mu

-strongly convex individual functions - one must also know which individual function is being referred to by the oracle at each iteration. Next, we show that for a broad class of first-order and coordinate-descent finite sum algorithms (including, e.g., SDCA, SVRG, SAG), it is not possible to get an `accelerated' complexity bound of \tilde{\cO}((n+\sqrt{n L/\mu})\ln(1/\epsilon)), unless the strong convexity parameter is given explicitly. Lastly, we show that when this class of algorithms is used for minimizing

L

-smooth and convex finite sums, the optimal complexity bound is \tilde{\cO}(n+L/\epsilon), assuming that (on average) the same update rule is used in every iteration, and \tilde{\cO}(n+\sqrt{nL/\epsilon}), otherwise

arXiv.org e-Print Archive

Dimension-Free Iteration Complexity of Finite Sum Optimization Problems

Author: Arjevani Yossi
Shamir Ohad
Publication venue
Publication date: 29/06/2016
Field of study

Many canonical machine learning problems boil down to a convex optimization problem with a finite sum structure. However, whereas much progress has been made in developing faster algorithms for this setting, the inherent limitations of these problems are not satisfactorily addressed by existing lower bounds. Indeed, current bounds focus on first-order optimization algorithms, and only apply in the often unrealistic regime where the number of iterations is less than

\mathcal{O}(d/n)

(where

d

is the dimension and

n

is the number of samples). In this work, we extend the framework of (Arjevani et al., 2015) to provide new lower bounds, which are dimension-free, and go beyond the assumptions of current bounds, thereby covering standard finite sum optimization methods, e.g., SAG, SAGA, SVRG, SDCA without duality, as well as stochastic coordinate-descent methods, such as SDCA and accelerated proximal SDCA

arXiv.org e-Print Archive

Communication Complexity of Distributed Convex Learning and Optimization

Author: Arjevani Yossi
Shamir Ohad
Publication venue
Publication date: 28/10/2015
Field of study

We study the fundamental limits to communication-efficient distributed methods for convex learning and optimization, under different assumptions on the information available to individual machines, and the types of functions considered. We identify cases where existing algorithms are already worst-case optimal, as well as cases where room for further improvement is still possible. Among other things, our results indicate that without similarity between the local objective functions (due to statistical data similarity or otherwise) many communication rounds may be required, even if the machines have unbounded computational power

arXiv.org e-Print Archive

On Lower and Upper Bounds in Smooth Strongly Convex Optimization - A Unified Approach via Linear Iterative Methods

Author: Arjevani Yossi
Publication venue
Publication date: 23/10/2014
Field of study

In this thesis we develop a novel framework to study smooth and strongly convex optimization algorithms, both deterministic and stochastic. Focusing on quadratic functions we are able to examine optimization algorithms as a recursive application of linear operators. This, in turn, reveals a powerful connection between a class of optimization algorithms and the analytic theory of polynomials whereby new lower and upper bounds are derived. In particular, we present a new and natural derivation of Nesterov's well-known Accelerated Gradient Descent method by employing simple 'economic' polynomials. This rather natural interpretation of AGD contrasts with earlier ones which lacked a simple, yet solid, motivation. Lastly, whereas existing lower bounds are only valid when the dimensionality scales with the number of iterations, our lower bound holds in the natural regime where the dimensionality is fixed.Comment: A related paper co-authored with Shai Shalev-Shwartz and Ohad Shamir is to be published soo

arXiv.org e-Print Archive

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Author: Arjevani Yossi
Shamir Ohad
Shiff Ron
Publication venue
Publication date: 17/08/2017
Field of study

Second-order methods, which utilize gradients as well as Hessians to optimize a given function, are of major importance in mathematical optimization. In this work, we prove tight bounds on the oracle complexity of such methods for smooth convex functions, or equivalently, the worst-case number of iterations required to optimize such functions to a given accuracy. In particular, these bounds indicate when such methods can or cannot improve on gradient-based methods, whose oracle complexity is much better understood. We also provide generalizations of our results to higher-order methods.Comment: 35 pages; Added discussion of matching upper bounds, and generalization to higher-order method

arXiv.org e-Print Archive

A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates

Author: Arjevani Yossi
Shamir Ohad
Srebro Nathan
Publication venue
Publication date: 26/06/2018
Field of study

We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from

\tau

rounds ago. First, we show that without stochastic noise, delays strongly affect the attainable optimization error: In fact, the error can be as bad as non-delayed gradient descent ran on only

1/\tau

of the gradients. In sharp contrast, we quantify how stochastic noise makes the effect of delays negligible, improving on previous work which only showed this phenomenon asymptotically or for much smaller delays. Also, in the context of distributed optimization, the results indicate that the performance of gradient descent with delays is competitive with synchronous approaches such as mini-batching. Our results are based on a novel technique for analyzing convergence of optimization algorithms using generating functions

arXiv.org e-Print Archive

Symmetry & critical points for a model shallow neural network

Author: Arjevani Yossi
Field Michael
Publication venue
Publication date: 11/03/2021
Field of study

We consider the optimization problem associated with fitting two-layer ReLU networks with

k

hidden neurons, where labels are assumed to be generated by a (teacher) neural network. We leverage the rich symmetry exhibited by such models to identify various families of critical points and express them as power series in

k^{-\frac{1}{2}}

. These expressions are then used to derive estimates for several related quantities which imply that not all spurious minima are alike. In particular, we show that while the loss function at certain types of spurious minima decays to zero like

k^{-1}

, in other cases the loss converges to a strictly positive constant. The methods used depend on symmetry, the geometry of group actions, bifurcation, and Artin's implicit function theorem

arXiv.org e-Print Archive

On the Principle of Least Symmetry Breaking in Shallow ReLU Models

Author: Arjevani Yossi
Field Michael
Publication venue
Publication date: 03/10/2020
Field of study

We consider the optimization problem associated with fitting two-layer ReLU networks with respect to the squared loss, where labels are assumed to be generated by a target network. Focusing first on standard Gaussian inputs, we show that the structure of spurious local minima detected by stochastic gradient descent (SGD) is, in a well-defined sense, the \emph{least loss of symmetry} with respect to the target weights. A closer look at the analysis indicates that this principle of least symmetry breaking may apply to a broader range of settings. Motivated by this, we conduct a series of experiments which corroborate this hypothesis for different classes of non-isotropic non-product distributions, smooth activation functions and networks with a few layers

arXiv.org e-Print Archive

Analytic Characterization of the Hessian in Shallow ReLU Models: A Tale of Symmetry

Author: Arjevani Yossi
Field Michael
Publication venue
Publication date: 04/08/2020
Field of study

We consider the optimization problem associated with fitting two-layers ReLU networks with

k

neurons. We leverage the rich symmetry structure to analytically characterize the Hessian and its spectral density at various families of spurious local minima. In particular, we prove that for standard

d

-dimensional Gaussian inputs with

d\ge k

: (a) of the

dk

eigenvalues corresponding to the weights of the first layer,

dk - O(d)

concentrate near zero, (b)

\Omega(d)

of the remaining eigenvalues grow linearly with

k

. Although this phenomenon of extremely skewed spectrum has been observed many times before, to the best of our knowledge, this is the first time it has been established rigorously. Our analytic approach uses techniques, new to the field, from symmetry breaking and representation theory, and carries important implications for our ability to argue about statistical generalization through local curvature

arXiv.org e-Print Archive

On the Complexity of Minimizing Convex Finite Sums Without Using the Indices of the Individual Functions

Author: Arjevani Yossi
Daniely Amit
Jegelka Stefanie
Lin Hongzhou
Publication venue
Publication date: 08/02/2020
Field of study

Recent advances in randomized incremental methods for minimizing

L

-smooth

\mu

-strongly convex finite sums have culminated in tight complexity of

\tilde{O}((n+\sqrt{n L/\mu})\log(1/\epsilon))

and

O(n+\sqrt{nL/\epsilon})

, where

\mu>0

and

\mu=0

, respectively, and

n

denotes the number of individual functions. Unlike incremental methods, stochastic methods for finite sums do not rely on an explicit knowledge of which individual function is being addressed at each iteration, and as such, must perform at least

\Omega(n^2)

iterations to obtain

O(1/n^2)

-optimal solutions. In this work, we exploit the finite noise structure of finite sums to derive a matching

O(n^2)

-upper bound under the global oracle model, showing that this lower bound is indeed tight. Following a similar approach, we propose a novel adaptation of SVRG which is both \emph{compatible with stochastic oracles}, and achieves complexity bounds of

\tilde{O}((n^2+n\sqrt{L/\mu})\log(1/\epsilon))

and

O(n\sqrt{L/\epsilon})

, for

\mu>0

and

\mu=0

, respectively. Our bounds hold w.h.p. and match in part existing lower bounds of

\tilde{\Omega}(n^2+\sqrt{nL/\mu}\log(1/\epsilon))

and

\tilde{\Omega}(n^2+\sqrt{nL/\epsilon})

, for

\mu>0

and

\mu=0

, respectively

arXiv.org e-Print Archive